Lengthscales and Cooperativity in DNA Bubble Formation 
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It appears that thermally activated DNA bubbles of different sizes play central roles in important 
genetic processes. Here we show that the probability for the formation of such bubbles is regulated 
by the number of soft AT pairs in specific regions with lengths which at physiological temperatures 
are of the order of (but not equal to) the size of the bubble. The analysis is based on the Peyrard- 
Bishop-Dauxois model, whose equilibrium statistical properties have been accurately calculated here 
with a transfer integral approach. 



The genetic code underlying all forms of life is encoded 
in the DNA molecule by the four bases guanine (G), 
thymine (T), adenine (A), and cytosine (C) strung along 
a sugar-phosphate backbone in a particular sequence. 
The four bases are, through hydrogen bonding, pair- 
wise complementary (A-T and G-C) allowing the coding 
strand and it's complement to form the characteristic 
double helical DNA macromolecule. Although this con- 
struct is extraordinarily stable, it is clearly necessary that 
the double strands be separated in biological processes, 
including gene transcription, where the code is read by 
the appropriate protein machinery in the cell. It has 
long been an experimental fact that the DNA double- 
strand can be thermally destabilized locally to form tem- 
porary single stranded "bubbles" in the molecule. This 
local melting is made possible by the entropy gained by 
transitioning from the very rigid double-strand to the 
much more flexible single-strand, which already at biolog- 
ically relevant temperatures can balance the energy cost 
of breaking a few base pairs. Considering this entropic 
effect together with the inherent energetic heterogeneity 
- GC base pairs are 25 % more strongly bound than the 
AT bases - of a DNA sequence, it is conceivable that 
certain regions (subsequences) are more prone to such 
thermal destabilization than others: This has been con- 
firmed by model calculations as well as experiments. We 
have previously argued that such regions may indeed 
experimentally coincide with transcription initiation and 
regulatory sites. In this way, the DNA molecule may help 
initiate its own transcription by containing bubble form- 
ing subsequences at the crucial positions in the sequence 
where the transcription machinery assembles and engages 
its operation. If a robust general link between the forma- 
tion of large thermal bubbles and transcription initiation 
is sufficiently established, it becomes crucially important 



to be able to accurately predict the subsequence of DNA 
with propensity for the formation of bubbles of appropri- 
ate sizes. 

Here we show that the probabilities of finding bubbles 
extending over n sites do not depend on a specific DNA 
subsequences. Rather, such probabilities depend on the 
density of soft A/T base pairs within specific regions of 
length L(k). This characteristic length is of the order of 
the size n of the bubble at physiological temperatures, 
but it diverges as the DNA melting temperature is ap- 
proached. Our results are based on a calculation of the 
thermal equilibrium statistical properties of the Peyrard- 
Bishop-Dauxois (PBD) model |31l1 using a transfer inte- 
gral operator (TIO) technique. This model constitutes a 
very powerful tool to not only predict bubble formation 
probability in a given sequence but also to understand the 
underlying physical mechanisms Our previous study 
of the PBD model has been performed using Langevin 
and Monte Carlo techniques 0. However, since our 
interest is centered on a very small portion of the thermo- 
dynamical equilibrium state, namely on the formation of 
large bubbles, dynamical and iterative samplings as of- 
fered by these methods are not very efficient. Therefore, 
we have developed here a semi- analytic approach based 
on the TIO [a, 111 that allows us to efficiently calculate 
relevant thermodynamical probabilities. 

The potential energy of the PBD model, in its simplest 
form, reads 



A' 



where V(y n ) = D n (e CLnVn ~ l) 2 , represents the nonlin- 
ear hydrogen bonds between the bases. W(y n ,y n -i) = 
| (1 + pe - b (v»+V"-i)) (y n - y n -i) 2 is the nearest- 
neighbor coupling that represents the (nonlinear) stack- 
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ing interaction between adjacent base pairs: it is com- 
prised of a harmonic coupling with a state depended cou- 
pling constant effectively modeling the change in stiffness 
as the double strand is opened (i.e. entropic effects). 
This nonlinear coupling results in long-range cooperative 
effects, leading to a sharp entropy-driven denaturation 
transition 0,^3- The sum in Eq.jjj is over all base-pairs 
of the molecule and y n denotes the relative displacement 
from equilibrium bases at the n th base pair. The im- 
portance of the heterogeneity of the sequence is incorpo- 
rated by assigning different values to the parameters of 
the Morse potential, depending on the the base-pair type. 
The parameter values we have used are those from Refs. 
[ill If^ l chosen to reproduce a variety of thermodynamic 
properties. 

Transfer Integral Method. All equilibrium, thermo- 
dynamic properties of the model can be obtained 
through the partition function 
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z = j n <*» 



K=l 
S + K — 1 



IJ dy n Z K (s)e-^y">y^\ (2) 



where the notation 



N 



n^s,...,s-\-K—l 

has been introduced. (3 — (fceT) -1 is the Boltzmann 
factor. In order to evaluate the partition function @ 
using the TIO method, we first symmetrize e~^ el ^ x ' y ^ by 
introducing [Toj 

S(x,y) = exp(-^(V(x)+V(y)+2W(x,y)) 
= S{y,x). 

Here the second equality holds only when x and y corre- 
spond to base-pairs of the same kind. Using Eq. J5J) the 
expression for Z K (s) is rewritten as 



Z K (s) 
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xdy e 



-f^(yi) e -f^(sw) 
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where open boundary conditions at n = 1, and n = N 
have been used. To proceed, a Fredholm integral equa- 
tions with a real symmetric kernel 



dyS(x,y)<j){y) = X(j>(x) 



(4) 



must be solved separately for the A/T and for the G/C 
base-pairs. 



Since the eigenvalues are orthonormal and the eigen- 
functions form a complete basis, Eq.QJ can be used se- 
quentially to replace all integrals by matrix multiplica- 
tions in Eq. J3J). Whenever the sequence heterogeneity 
results in a non-symmetric S(x, y), Eq.@ cannot be used 
and we resort to a symmetrization technique, based on 
successive introduction of auxiliary integration variables, 
as explained in Ref. |13| . 

As noted, in order to quantify the sequence depen- 
dence on DNA's ability to form bubbles of different sizes, 
we have previously monitored the frequency of opening 
events using Langevin and Monte Carlo simulation tech- 
niques. Since the large openings constitute relatively rare 
events such techniques are not efficient (although essen- 
tial for evaluating dynamical and non-equilibrium prop- 
erties). It is much more effective to imply the proba- 
bilities of large bubbles at a given site in the sequence 
directly from the thermodynamic distributions using the 
TIO. Importantly, we have confirmed below that this 
equilibrium approach reproduces the bubble locations ob- 
served by Langevin simulations for the same sequences. 
This suggests that the bubbles - although large bubbles 
are rare events - are governed by equilibrium statistics. 

We evaluate the probabilities P K {s), for a base-pair 
opening spanning k base-pairs (our operational definition 
of a bubble of size k), starting at base-pair s as 



/OO S + K — 1 
II dy n Z K { S ) t 



(•5) 



where t is the separation (which we have taken as 1.5 A) 
of the double strand above which we define the strand to 
be melted. 

Numerical Results. Using this technique, we are able 
to systematically investigate the relation between a given 
sequence containing a (apparently) disordered mixture 
of A/T and G/C pairs and the probability of sponta- 
neous, thermally activated, bubbles of various sizes. Our 
analysis begins with a thorough study of two viral pro- 
moter sequences, Adeno major late promoter (AMLP) 
and Adeno Associate viral promoter (AAV P5). We have 
previously investigated the dependence of the thermally 
induced large bubbles in these sequences 0, and found 
that the opening profiles obtained through Langevin sim- 
ulations of the PBD model agreed remarkably well with 
the local denaturation profiles indicated by SI nuclease 
experiments (see Ref. |2j for details). Here we use the 
TIO to calculate the probabilities Eq. © for the ther- 
mal creation of bubbles of size 1,3,7, and 10 base-pairs 
for the AdMLP promoter at T = 300 K (Fig. [TJ. The 
significant feature of the sequence is the occurrence of a 
TATA-box at base-pair location —30 with 7 consecutive 
A/T base-pairs. Around +1 there is rich region contain- 
ing ~ 12 A/T base pairs, which, however, are not located 
consecutively since a comparable amount of G/C pairs 
are alternately embedded among the A/T pairs. Since 
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FIG. 1: Probabilities P K of creating bubbles spanning k = 
1, 3, 7, and 10-bp, respectively, for the Adeno major late pro- 
moter at T=300K. 

A/T base-pairs are more weakly bound (softer) than G/C 
pairs, we could reasonably expect that bubbles have a 
predominant opening probability in the region -30. This 
is indeed the case for small bubbles, Figs. and^a . 
However, surprisingly, this prediction breaks down when 
considering bubbles of larger sizes. The corresponding 
probability increases around bp +1 (Fig. QJ;), up to the 
point that, for a bubble of size 10 bp, it becomes the 
highest (Fig. ^i). This finding illustrates the strong in- 
terplay between the sequence of base-pairs and the size of 
the bubble in the thermal activity of DNA (Indeed, it is 
likely that bubbles of different sizes may initiate different 
genetic processes). 
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This is illustrated with the AAV P5 promoter (Fig[3J). 
This sequence regulates the AAV gene expression, and 
it has been shown 14] to bind the transcription initia- 
tor Yin Yang 1 (YY1) and to be active for TATA-Box 
protein (TBP)-independcnt transcription. The mutation 
of this promoter in which the two A/T bases at +1 and 
+2 are replaced by two G/C bases, is known to destroy 
the binding site for the YY1 initiator and thereby inhibit 
transcription. We have previously shown by Langevin 
simulations of the PBD model that this mutation also 
suppresses the formation of large bubbles around bp +1. 
Here we again calculate the probability to obtain bubbles 
of various sizes using the TIO. In Fig. [21 we show the 
probability of obtaining bubbles of sizes n=l,3,5, and 10 
for the wild (dashed line) and the mutated (solid line) 
AAV P5 promoter. The mutation causes a dramatic 
change in the double strand's ability to form large bub- 
bles at and around the mutated region. However, the 
Pi and the Piq probabilities are much less affected by 
the mutation. Notice for the wild type P5 AAV pro- 
moter, the region around the TATA-box has the largest 
probability for forming large bubbles (panel d). It is im- 
portant to note that the AAV P5 promoter has four A/T 
rich regions: four consecutive A/T's around position -40: 
seven A/T's around position -30; five A/T's at the tran- 
scription start site +1; six A/T's around +14, and all 
these soft regions are clearly discernible in Pi (panel a). 
From these results on the AAV P5 and AdMLP promot- 
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FIG. 3: Probability P K for the formation of bubbles of sizes 
ft = 1,4,7, 10 bps. The sequences are composed of 20 G/C, 
5 A/T, 20 G/C followed by different sequences comprising 
3,4,5,6,7 A/T alternating with G/C bps. The last 20 bps are 
G/C. 



FIG. 2: Probabilities P K of creating bubbles of 1 bp (n = 1) 
length (a), 3 bp (b), 5 bp (c) and 10 bp (d) length, for the 
wild (dashed line) and mutant (solid line) P5 promoter. 

The replacement of 1 or 2 soft A/T with hard G/C base 
pairs in specific regions of the DNA can also hugely affect 
the probability for the formation of bubbles of given sizes. 



ers we can speculate that the occurrence and intensity 
of a peak in the bubble probabilities does not depend on 
the specific composition of the DNA fragment. Rather, 
there is an essential interplay between the content of A/T 
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and G/C base pairs and the size of the bubble being ex- 
amined. Understanding the mechanisms regulating the 
DNA openings are of great importance for predicting and 
engineering DNA processes, and we therefore now con- 
sider a series of simple (but experimentally realizable) 
DNA sequences where the effects discussed above are re- 
produced in detail. Our purpose is to isolate the under- 
lying mechanisms. Our five sequences are all composed 
of 20 G/C, 5 A/T and 20 G/C base pairs. This is fol- 
lowed by a sequence that alternates A/T and G/C base 
pairs. We use 3, 4, 5, 6, and 7 A/T base-pairs in the five 
sequences. Finally, all five sequences are terminated with 
20 G/C base-pairs. 

As shown in Fig. OK, the largest 1 bp opening prob- 
ability is localized at +20, a region that contains five 
consecutive A/T bases, and is therefore expected to be 
more susceptible to open than the region localized around 
+50, containing A/T's alternating with G/C's. However, 
this simple picture changes dramatically as we move to 
larger bubbles, Figs. Eb, Gt> and 01. In all these cases, 
the height of the second peak increases as compared to 
the peak at +20. With 3 and 4 A/T's, the peaks saturate 
at a value lower than the first peak. However, the height 
of the two peaks for the sequence with 5 A/T's, becomes 
equal in P7, and remains so for larger bubbles. The se- 
quence with 6 A/T's shows an inversion of the opening 
probability, similar to that observed in the AdMLP se- 
quence: at Pio the most probable 10 base-pair opening 
occurs around the base-pair location 50. These data in- 
dicate that the opening probability of a bubble of a given 
length does not trivially depend on the number of consec- 
utive A/T's in the DNA sub-sequence. Instead, bubbles 
of sizes n form with higher probabilities in regions where 
the number of A/Ts over some characteristic length L(n) 
is higher, even if the A/Ts are mixed with G/C pairs. 

To confirm this hypothesis we have extracted the char- 
acteristic lengths L(k) as a function of the bubble size n 
from the probability distributions of the simple sequences 
considered in Fig. For instance, for n = 1, Fig. Eli , we 
have obtained L(l) = 4 sites, while for n = 5, Fig. |3Jd, 
we have L(5) = 10 sites. We have therefore considered 
the AAV P5 promoter DNA sequence of Fig. Start- 
ing from each site s of the sequence, we count the num- 
ber N K (s) of A/T pairs contained over the corresponding 
next L(k) sites. In Fig. 0^ we show the results for bub- 
bles of size k = 1, which can be compared with Fig. [3^ 
. The small difference between the mutant and the wild 
opening probability for the sites around located at is 
well reproduced. The difference between the wild and 
the mutant sequence is most pronounced for k = 5, Fig. 

which is also in agreement with the TIO calculation 
shown in Fig. 

We have also demonstrated that the opening probabil- 
ity does not depend on the specific distribution of the 
AT pairs contained in the characteristic regions of length 
L(k). This is shown in Fig[Sl where we have calculated 
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FIG. 4: Number of A/T bps contained in the characteristic 
length L(k), where k = 1 (top panel), and 5 (bottom panel) 
for the wild (dashed line) and mutant (solid line) P5 promoter. 
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FIG. 5: Probability Pio for the formation of a bubble of 10 
consecutive bps. The sequences consist of 40 G/C, 5 A/T 
and 20 G/C followed by 14 bps containing different random 
combinations of 7 A/T and 7 G/C. All sequences end with 40 
G/C bps. 



with the TIO the probability for formation of bubbles 
with size Pio for a sequence where the second A/T rich re- 
gions always contains 7 A/T distributed in several differ- 
ent combinations, but always over a 20-base region. We 
see that independently of the distribution of A/T base- 
pairs, the probability of 10 base-pair bubbles is always 
largest in the right-most region. Physically, we interpret 
this as the nonlinear coherence dominating (smoothing 
out the effect of) the base-pair disorder. 

In summary, we have developed a semi-analytical tech- 
niques (TIO) that allows the efficient prediction of a given 
sequence for thermally induced bubbles of given sizes. 
We have found that large thermally induced bubbles arise 
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through a subtle interplay between length scales inher- 
ent in the nonlinear dynamics, and the sequence disorder. 
Our results provide new understandings that can help to 
not only identify new protein coding genes, but also en- 
able reverse-engineering for use in future gene therapeu- 
tic applications. 

This work at Los Alamos National Laboratory is sup- 
ported by the US Department of Energy (contract No. 
W-7405-ENG-36) and by a NIH grant for A.U. (Grant 
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